This workflow uses the auto_annot tools from besca to newly annotate a scRNAseq dataset based on one or more preannotated datasets. Ideally, these datasets come from a similar tissue and condition.
We use supervised machine learning methods to annotate each individual cell utilizing methods like support vector machines (SVM) or logistic regression.
First, the traning dataset(s) and the testing dataset are loaded from h5ad files or made available as adata objects. Next, the training and testing datasets are corrected using scanorama, and the training datasets are then merged into one anndata object. Then, the classifier is trained utilizing the merged training data. Finally, the classifier is applied to the testing dataset to predict the cell types. If the testing dataset is already annotated (to test the algorithm), a report including confusion matrices can be generated.
import besca as bc
import scanpy as sc
import pkg_resources
Apparently the scv loader makes sure the adata objects are all in comparable format whereas the sc loader loads them as is.
adata_test = bc.datasets.Kotliarov2020_processed()
adata_test_orig = bc.datasets.Kotliarov2020_processed()
adata_train1 = bc.datasets.Granja2019_processed()
Concatenation does not lead to errors when the scv loader is used.
adata_train_list = [adata_train1]
Give your analysis a name.
analysis_name = 'auto_annot_pubimage_trainGtestK' # The analysis name will be used to name the output files
Specify column name of celltype annotation you want to train on.
celltype ='dblabel' # This needs to be a column in the .obs of the training datasets (and test dataset if you want to generate a report)
Choose a method:
method = 'logistic_regression'
Specify merge method. Needs to be either scanorama or naive.
merge = 'scanorama' # We recommend to use scanorama here
Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.
use_raw = False # We recommend to use False here
You can choose to only consider a subset of genes from a signature set or use all genes.
genes_to_use = 'all' # We suggest to use all here, but the runtime is strongly improved if you select an appropriate gene set
Column names need to be standardised so the function knows which columns to compare.
adata_train_list[0].obs["dblabel"] = adata_train_list[0].obs.celltype3
adata_test.obs["dblabel"] = adata_test.obs.celltype3
adata_test_orig.obs["dblabel"] = adata_test_orig.obs.celltype3
adata_test.obs.dblabel.unique()
adata_train_list[0].obs.dblabel.unique()
adata_test.var.dtypes
adata_train_list[0].var.dtypes
This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.
adata_train, adata_test_corrected = bc.tl.auto_annot.merge_data(adata_train_list, adata_test, genes_to_use = genes_to_use, merge = merge)
The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance). The scaling will then be applied to the counts in the testing dataset and then the classifier is applied to the scaled testing dataset (see next step, adata_predict()). This function will run multiple jobs in parallel if if logistic regression was specified as method.
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype, njobs=10)
If in addition to the most likely class you would like to have all class probabilities returned use the following function. (This is only a sensible choice if using logistic regression.)
adata_predicted = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.0)
The adata object that includes the predicted cell type annotation can be written out as h5ad file.
adata_predicted.write('./adata_predicted_trainGtestK.h5ad')
If the testing dataset included already a cell type annotation, a report can be generated and written, which includes metrics, confusion matrices and comparative umap plots.
adata_predicted.obs
adata_predicted = bc.st.clustering(adata_predicted, '.')
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted, celltype=celltype, method=method, analysis_name=analysis_name,
train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
sc.settings.set_figure_params(dpi=240)
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7, save= '.fig4_supp_trainGtestKondata.svg')
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4_supp_trainGtestK.svg')
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes, celltype,
normalize=False,
title=None, numbers =False,
cmap=plt.cm.Blues, adata_predicted= None, asymmetric_matrix = True):
matplotlib.use('Agg')
if not title:
if normalize:
title = 'Normalized confusion matrix'
else:
title = 'Confusion matrix, without normalization'
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
# Only use the labels that appear in the data
#classes = classes[unique_labels(y_true, y_pred)]
if asymmetric_matrix == True:
class_names = np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
class_names_orig = np.unique(adata_predicted.obs[celltype])
class_names_pred = np.unique(adata_predicted.obs['auto_annot'])
test_celltypes_ind = np.searchsorted(class_names, class_names_orig)
train_celltypes_ind = np.searchsorted(class_names, class_names_pred)
cm=cm[test_celltypes_ind,:][:,train_celltypes_ind]
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
fig, ax = plt.subplots(figsize=(15,15))
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax, shrink = 0.8)
# We want to show all ticks...
if asymmetric_matrix == True:
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=class_names_pred, yticklabels=class_names_orig,
title=title,
ylabel='True label',
xlabel='Predicted label')
else:
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
ax.grid(False)
#ax.tick_params(axis='both', which='major', labelsize=10)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
if numbers == True:
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
#fig.tight_layout()
return ax
import os
# make conf matrices (4)
class_names = np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainGtestK_confusion_matrix_nonnormalised.svg'))
# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype, normalize=True, numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainGtestK_confusion_matrix_normalised.svg'))
analysis_name = 'auto_annot_pubimage_threshold_trainGtestK' # The analysis name will be used to name the output files
adata_predicted_threshold = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.7)
adata_predicted_threshold.write('./adata_predicted_threshold_trainGtestK.h5ad')
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.report(adata_pred=adata_predicted_threshold, celltype=celltype, method=method, analysis_name=analysis_name,
train_datasets=adata_train_list, test_dataset=adata_test_orig, merge=merge, use_raw=False,
genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
sc.settings.set_figure_params(dpi=240)
sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7, save= '.fig4_supp_trainGtestK_threshold_ondata.svg')
sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig4_supp_trainGtestK_threshold.svg')
# make conf matrices (4)
class_names = np.unique(np.concatenate((adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainG_testK_confusion_matrix_threshold_nonnormalised.svg'))
# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype, normalize=True, numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig4_supp_trainG_testK_confusion_matrix_threshold_normalised.svg'))
adata_predicted_wo_unknown = adata_predicted_threshold.copy()
adata_predicted_wo_unknown = bc.subset_adata(adata_predicted_wo_unknown, adata_predicted_wo_unknown.obs.auto_annot != 'unknown', raw=False)
bc.pl.riverplot_2categories(adata_predicted_wo_unknown, [celltype, 'auto_annot'])
gmt_file_IMM=pkg_resources.resource_filename('besca', 'datasets/genesets/HumanCD45p_scseqCMs6.gmt')
bc.tl.sig.combined_signature_score(adata_predicted, gmt_file_IMM)
adata_predicted.var_names
scores = [x for x in adata_predicted.obs.columns if 'CD45' in x]
scores
Indeed it seems like the classification of B cells is an improvement, whereas the varieties of T cells pose difficulties.
sc.pl.umap(adata_predicted, color= ["score_HumanCD45p_scseqCMs6_MemB_scanpy", "score_HumanCD45p_scseqCMs6_NaiveB_scanpy","CD4", "CD8A"], ncols = 2, wspace = 0.4, color_map = 'viridis',save= '.fig4_markers.svg')
sc.pl.umap(adata_predicted, color= ["IL7R"], ncols = 2, wspace = 0.4, color_map = 'viridis')